Univariate Plots

PH345: Winter 2025

Phil Boonstra

Univariate Plots

Univariate plots are used to visualize the distribution of a single variable:

  • What are the typical values?
  • What is the spread?
  • Are there ‘outliers’?

Examples

  • Histograms
  • Boxplots
  • Density plots
  • Barplots
  • Violin plots

Ultra-runner data (Samtleben, 2023)

\(n = 288\) ultra-runners (completing 100km ultra-marathons)

Each runner’s personal best (in hours):

  [1] 14.00  7.60 14.20 14.33 17.00 12.00 16.00 16.16  9.95 17.55 12.50 23.00
 [13] 18.85  8.50 19.33 16.00 18.00 12.00 14.38 15.00 19.25 14.00 14.21 14.91
 [25] 14.50 19.00 18.50 15.00 20.00 12.16 14.82 12.99 13.50 12.98  9.20 10.00
 [37] 13.55 14.00 15.00 14.00 22.00 15.33 15.53 12.26 12.00 13.00 10.77 13.98
 [49] 14.00 16.00 12.05 20.83 14.00 14.30 14.78 13.83 16.00 12.90 19.67 14.00
 [61] 16.67 10.25 15.38 13.35 14.00 22.00  7.15 14.00 12.00 16.00  9.50 15.13
 [73] 12.99 18.77 15.00 11.25 14.00 13.00 14.53 18.75 16.00 14.50 18.66 15.50
 [85] 12.77  9.05 16.30 17.00 22.00  9.50 15.46  8.70 16.75 12.00 14.41 10.50
 [97] 17.00 11.17 15.50 17.00 13.86 20.00 10.45 10.34 13.33 14.50  7.90 11.00
[109] 10.71 12.00 15.36 19.41 14.00  9.00 15.16 12.00 18.81 10.50 12.00 14.00
[121] 15.00  9.00 20.00 21.50 11.33 15.00 21.25 23.00 22.00 18.60 21.90 16.16
[133] 15.50 13.71 23.50 10.33  8.70 18.00 12.83 10.49 13.33 14.86 19.99 15.66
[145] 22.36 22.40 16.00 16.52 11.25 13.06  9.60 14.25 20.00 20.00 13.75 10.34
[157] 12.25 13.25 12.00 10.95 16.75 13.25 14.00 13.65 18.00 18.00 15.00 12.70
[169] 17.50 19.66 11.51 12.71 12.00 17.00 13.00  6.50 19.00 19.70 14.25  9.86
[181] 23.00 15.33 14.65 15.60 22.00 14.00 14.00 16.86 14.51 13.51 13.75 18.51
[193] 19.75 20.80 15.99 16.34 25.00 13.00 16.88 12.95 11.50 12.75 11.16 12.70
[205] 10.13 17.01 11.24 12.60 20.00 14.01 13.05 13.18 12.00 12.00 15.38 15.00
[217] 10.52 15.16  9.90 13.50 21.68 20.00 19.00 12.00 14.91 11.00 14.36 11.00
[229] 17.00 11.99 12.46 20.00 15.01 12.41 13.49 14.00 13.20 13.55 13.96 10.95
[241] 16.00 11.80 17.00 11.65 13.58 13.09 13.86 16.00 15.00 12.08 14.16 11.00
[253] 18.00 12.85 22.00 11.50 14.66 10.16 13.00  7.50 19.84 16.75 12.00 25.25
[265] 15.50 13.36 10.00 17.00 12.83 16.00 12.50 16.00  9.18 16.50 14.41 14.25
[277] 19.00 15.00 13.36 17.83 10.50 11.75 12.75 19.75 15.40 21.00 18.00 14.46

https://causeweb.org/tshs/ultra-running/

Creating a histogram

Appropriate for summarizing a set of numbers (continous variables)

  1. Choose a bin size and a center value, e.g. one hour bins centered at the integers would be denoted as \((5.5, 6.5]\), \((6.5, 7.5]\), \((7.5, 8.5]\), etc. Bins will be non-overlapping. Calculate enough bins to completely cover data

  2. Assign each runner to a bin, e.g. 13.50 goes into the \((12.5, 13.5]\) bin and 13.51 goes in to the \((13.5, 14.5]\) bin

  3. Plot bars for each bin, with the height of the bar corresponding to the number of runners in that bin

Bin width of 10 hours – too large

Bin width of 3 minutes – too small

Density plot

  • Alternative to summarizing continuous variable
  • Smoothed version of histogram (but amount of smoothing is adjustable)
  • \(y\)-axis is density: single connected line and area under the line equals 1

Different amounts of smoothing

Comparison

Histogram

  • More familiar to most readers
  • \(y\)-axis is counts by default (but technically these should be densities*)
  • Requires choosing binwidth

Density plot

  • Less familiar to readers
  • \(y\)-axis is density
  • Algorithms to choose appropriate amount of smoothness

Generally no reason not to show both.

*Density \(\neq\) probability (but you can think of it as relative probability)

Mary E Spear

American graphic analyst for the US government for more than 30 years

Author of Charting Statistics (1952) and Practical Charting Techniques (1969)

Inventor of the ‘range bar’ (boxplot)

Fair use, https://en.wikipedia.org/wiki/File:Mary_Eleanor_Hunt_Spear_died_1986.png

Figure 7-28, Spear (1969)

Interquartile Range = segment from first quartile to third quartile

Boxplots

Sometimes called box-and-whisker plots

  1. Calculate the five-number summary: minimum, lower quartile, median, upper quartile, maximum
  2. Draw a box from lower quartile to upper quartile and line at median (box)
  3. Draw line segments extending from the edge of box to \(1.5\times\) IQR in either direction (whiskers)
  4. Any points outside of these segments are plotted directly (outliers)

“Jittered” points

Five-number summary

Whiskers

Enclose the box

Typical boxplot

Useful to include individual points

Random vertical jitter allows to see multiple datapoints with same value

Violin plots

More recent introduction is ‘violin plot’: density plot and its reflection

Easy to create and show for multiple variables at once, like boxplots, but better provides more accurate representation of distribution, like density plot

Simple violin plot

With boxplot on top

With datapoints on top

Comparison

Boxplot

  • One word
  • More familiar to readers
  • Clearly shows 1st, 2nd, and 3rd quartiles
  • Easy to draw by hand
  • Potentially misleading representation of distribution

Violin plot

  • Two words
  • Less familiar to most readers
  • Doesn’t show quartiles (but can be easily added)
  • Requires choosing binwidth (same as density plot)
  • More accurate representation of distribution

Generally no reason not to show both

Bar Charts

Univariate summaries of categorical variable

Can show counts or proportions

Simple Bar Chart

Stacked Bar Chart

Filled Bar Chart

Dodged Bar Chart

Stacked vs. Filled vs. Dodged

All are different flavors of bar charts

Stacked

  • Emphasizes overall count differences between bars
  • Difficult to assess subgroup count differences between bars (except first subgroup)
  • Difficult to count small categories

Filled

  • Easy to assess overall proportional differences between bars for first and last subgroups
  • Difficult to assess subgroup proportional differences between bars in interior subgroups
  • No information on counts

Column plot

  • Emphasis on count differences between bars for each subgroup
  • Difficult to assess overall count differences between bars
  • Difficult to count small categories

Bars vs. Histograms vs. Columns

Bar charts / Bar plots

  • Only appropriate for categorical variables
  • Bars are categories. Nothing exists between bins
  • Bars may be ordered by value or by count
  • \(y\)-axis can be counts or proportions
  • Use geom_bar()

Histogram

  • Only appropriate for continuous variables
  • Bins are intervals based on binwidth. No gap between bins
  • Bins are naturally ordered by value
  • \(y\)-axis can be counts or densities
  • Use geom_histogram()

Column plot

  • Special type of bivariate plot:
    • \(x\)-axis is categorical
    • \(y\)-axis is continuous
    • Bars instead of points
    • Bars drawn from \(y\) to 0
  • Use geom_col()

Bivariate extensions

Bivariate extensions

Bivariate extensions

Layering in ggplot

Adding multiple geometric objects to one ggplot object results in multiple layered views of the information.

Example of jittered points on top of boxplot:

ggplot(ultrarunning) +
1  geom_boxplot(aes(x = pb100k_dec), outlier.shape = NA) +
2  geom_jitter(aes(x = pb100k_dec, y = 0), width = 0, height = 0.25) +
  scale_x_continuous(name = "Personal best time (hours)") +
  scale_y_continuous(breaks = NULL, name = NULL, limits = c(-1, 1)) +
  theme(text = element_text(size = 24))
1
First create boxplot
2
Then add points across \(y=0\) line with random vertical jitter

Other examples of layering we’ve seen so far: layering density over histogram of running times (slide 15); data points over a boxplot (slide 24); boxplot over violin plot (slide 27)

When to layer?

Some situations when you would want to layer:

  • When single layer has critical weakness / deficiency (Avoid distorting what the data say)
  • When you want to highlight both granular and aggregate components of the data (Reveal the data at several levels of detail, from broad overview to fine structure)
  • To anchor your data (layer A) within context of reference / other data (layer B) (Encourage comparison between different pieces of data)

Code Together Task

No Spice: Make the histogram on slide 9;

Weak Sauce: Make the boxplot on slide 23;

Medium Spice: Make the layered violin plots on slides 26 or 27;

Yoga Flame: Make the layered Density+Histogram on slide 15 (hint: use after_stat to get the correct y-axis for the histogram); or the layered boxplot on slide 24 (hint: use geom_jitter instead of geom_point); or one of the barcharts on slides 32, 33, or 34 (hint: you will need to use case_when to create some character variables before making the plots);

Dim Mak: Make one of the bivariate violin plots on slides 37–39;

References

Samtleben, E., 2023. Ultrarunning dataset. Teaching of Statistics in the Health Sciences Resource Portal, Available at https://www.causeweb.org/tshs/ultra-running/.

Spear, M.E., 1969. Practical charting techniques.